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Abstract 

Recent approaches to the problem of inferring a continuous probability dis- 
tribution from a finite set of data have used a scalar field theory for the form 
of the prior probability distribution. This letter presents a more general form 
for the prior distribution that has a geometrical interpretation which is useful 
for tailoring prior distributions to the needs of each application. Examples are 
presented that demonstrate some of the capabilities of this approach, includ- 
ing the applicability of this approach to problems of more than one dimension. 
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Inferring the continuous probability distribution, or target distribution, from which a 
finite number of data samples were drawn is an example of an ill-posed inverse problem: 
there are many different distributions that could have produced the given finite data. Often 
one has prior information, separate from the data itself, that can reduce the range of pos- 
sible target distributions. More generally, one can assign a prior probability to each target 
distribution based on the prior information. Combining this prior probability distribution 
with the likelihood of the data given any particular target distribution, using Bayes' rule 
of probability, produces a posterior probability over the space of target distributions. This 
posterior distribution encapsulates all the information available, both from the data and 
from the prior information, and can be used to make probabilistic inferences. 

Let P[Q\x\, . . . ,xn] denote the posterior probability that the target distribution Q(x) 
describes the data x\, . . . , Xn- By Bayes' rule, 

P[xt, . . . ,x N \ 
Q{xi)---Q{x N )P[Q] 



JVQQ(x 1 )---Q(x N )P[Q] 



where P[Q] is the prior probability of the target distribution Q. 

The form for P[Q] should incorporate the available prior information. For example, by 
setting Q(x) = ip 2 (x) [Q, where ip may take any value in (—00, 00), we may insure that Q 
is non-negative, ip is referred to as the amplitude by analogy with quantum mechanics 0. 
A particular form for P[Q], or rather P[ip], that has been presented in order to, the authors 
say, incorporate a bias that Q be "smooth" is 



1 

— exp 



5(1- dxip 2 ), (3) 



where Z is the normalization factor and £ is a constant which controls the penalty applied 
to gradients. The delta function enforces normalization of the distribution Q. 

Because this particular prior distribution is not very effective at generating smooth dis- 
tributions (as will be shown) and because the prior information available for each problem 
will vary, it is useful to consider a more general form for the prior distribution. A more 
general approach is to define the prior distribution as 



P[ip] = -exp 



<f (!-(#)), (4) 



where V is a positive, symmetric (Hermitian) operator within whatever Hilbert space is cho- 
sen for This distribution is a generalization of a multi-dimensional Gaussian distribution 
with V acting as the covariance operator. Continuing this analogy, we write 

K(x,y)=a(x)a(y)p(x,y) (5) 

where cx 2 (x) is the variance at x and p is the correlation function. Information about 
smoothness is encoded in the correlation function. For example, if the distribution from 
which the {xi} were drawn is expected to be smooth over distances smaller than a certain 
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spatial scale then the correlation function should be near unity over distances smaller than 
this scale. The prior distribution used in |||| is equivalent to the one presented here in one 
dimension with V' 1 = —£ 2 d 2 , assuming ip goes to zero at ±00. 

It is useful to consider this prior probability distribution in geometrical terms. The 
eigenfunctions of the the operator V form a basis for the space of ip. The normalization 
constraint restricts ip to lie on a hyper-spherical surface of radius one. Those eigenfunctions 
with larger eigenvalues are more likely, a priori. If V has any eigenvalues that are zero then 
the corresponding eigenfunctions form a basis for a subspace that is orthogonal to i/j; that is 
the prior distribution prevents i[) from having any components along these eigenfunctions. 

With this form for the prior distribution the probability P[Q\xi, . . . , x^) of a distribution 
Q given the data is 

P[^|a;i, ...,x N ]cc ip 2 (x{) • ■ ■ ip 2 {x N ) 



x exp 



<y(i-<V#» (6) 



= e -^(l-(V#», (7) 
where the effective action S is 

SM = \^\V- 1 \^)-2^n((x i m- (8) 

" i 

The most likely distribution given the data is that function ip c \ which minimizes the ef- 
fective action subject to the normalization constraint. To enforce this constraint a Lagrange 
multiplier term A(l — (ip\ip))/2 is subtracted from the action. Variational methods then lead 
to the following equations for ip c \ and A: 

IV;cl) = 2 ? (9a) 

(Veiled) = 1. (9b) 
The solution to these equations may be written 

M = Yla i U(X)\x i ), (10) 

i 

where U(\) = (V' 1 + XI)' 1 . Eqs. © imply 

a i a j( x i\U(\)\xj) = 2, i = l,...,N (Ha) 

j 

y £a i a j (xi\U*(\)\x j ) = l. (lib) 

These + 1 non-linear equations determine A and the and may be solved using Newton's 
method M. 
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FIG. 1. The most likely distributions from an inverse Laplacian prior distribution with I = 6 
and from N = 20 (dashed line) and N = 1000 (dotted line) data drawn randomly from a target 
distribution consisting of the sum of two Normal distributions (solid curve). 

The covariance operator V in the prior distribution should be chosen for each different 
probability distribution that one is estimating. A few examples with three different forms 
for V are described below in order to illustrate the effects that different choices of V can 
have. First consider the case used in 0|| in which the prior covariance operator is an inverse 
Laplacian in one dimension, V' 1 = — £ 2 <9^. In this case 

U(X) = (-fdl + Xiy l (12) 

which is the Green's function of the modified Helmholtz equation. The solutions of this 
equation are well known, even for dimensions larger than one |J. In particular, in one 
dimension the most likely solution ip c \{x) is, from Eq. [10| 

^d(aO = Yl a iyJ 2 ex "P(- k \ x - x i\) ( 13 ) 

i 

where k = \f\jt. Examples of the most likely probability distributions for this case with 
I = 6 are shown in Fig. [1|. For these examples the data were drawn from a target distribution 
consisting of the sum of two Normal distributions, shown as the solid curve in the figure. 
The most likely distributions are not very smooth, as would be expected from the functional 
form of Eq. [13|. 

For the second example consider the case in which the prior covariance operator has a 
correlation function which is a Gaussian, 



V(x, y; r) = cr 2 exp 



-(*-y) : 



2r" 



(14) 
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FIG. 2. The most likely distributions from a Sine function prior distribution with ko = 3.33 
and from the same N = 20 (dashed line) and N = 1000 (dotted line) data used for the examples in 
Fig. |l], which were drawn randomly from a target distribution consisting of the sum of two Normal 
distributions (solid curve). 

Here a 2 is the prior variance for the magnitude of the target probability distribution and r is 
a correlation scale below which the target probability distribution is believed to be smooth. 
In this case it is useful to expand U in an operator product expansion in V, 

U(X) = V (l - XV + X 2 V-V-X 3 V-V-V + ■■■). (15) 

Because V ■ V oc V(x, y; \p2r) for this particular V, Eq. generates a multi-resolution 
expansion, analogous to a wavelet expansion, for U and therefore also for ip c \ consisting of 
Gaussians of ever increasing width, increasing each step by a factor of \/2 with the finest 
scale being represented by the original V(x, y;r). This functional form for V therefore 
generates a most likely probability distribution that has finite derivatives to all orders and 
is generally more smooth than that from the first example. 

For the final example, consider the case in which the prior covariance operator is a 
projection operator that projects onto the subspace formed by functions having only Fourier 
wavenumbers smaller than a particular wavenumber ko- In one dimension this covariance 
operator is the Sine function, 

\rt i ^ sin [k {x - y)} 

V[x, y; k ) = r — . 16 

tt(x - y) 

Because this is a projection operator, V ■ V = V and from Eq. [15] U for this case is simply 
U(X) = V/(l + X). The most likely amplitude therefore consists of sums of Sine functions 
centered at each data point. Examples of the most likely probability distribution using 
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FIG. 3. The Fourier spectra of the three types of covariance operators shown in the legend. 
Free parameters in each case have been set to correspond roughly to a cutoff at wavenumber 
ko = 3.33. 



this prior distribution with = 3.33 are shown in Fig. g. The same data used for the 
examples in Fig. [I] were used here. Even with only 20 data points the most likely solution 
indicates a doubly peaked distribution. Both of the examples here are more smooth than 
those generated by the prior distribution discussed above in the first case and shown in 
Fig.0. 

It is useful to examine the Fourier spectrum of the prior covariance operator in order 
to understand some of the properties of the resulting most likely distribution. The Fourier 
spectra of the three covariance operators considered in the above examples are shown in 
Fig. [| Because of the form of the prior distribution (Eq. ^) those wavenumbers with larger 
Fourier amplitudes are more likely, a priori. However, in order to maximize the likelihood 
of the given data, the most likely amplitude will tend to consist of the largest possible 
wavenumber components. Because the Sine function covariance operator has the sharpest 
high wavenumber cutoff it will tend to generate the smoothest most likely distribution. 
Conversely, the inverse Laplacian covariance operator will tend to produce the least smooth 
most likely distribution. If the Sine function covariance operator is used, however, the cutoff 
k should be chosen with great care because this prior forbids any solutions containing 
wavenumbers higher than the cutoff. Thus if the chosen cutoff wavenumber is lower than the 
maximum wavenumber component of the target distribution then the most likely distribution 
will not converge to the target distribution as the number of data points increases. 
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